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EXAMPLE 1. 01 "Faloutsos" using a W-KwS (Google) 
Christos Faloutsos 

SCS CSD Professor's affiliatons, research, projects, publications and 
teaching. 

www.cs.cmu.edu/~christos/ - 9k 
Michalis Faloutsos 

The Homepage of Michalis Faloutsos ... Interesting and Miscallaneous 
Links • Fun pictures • Other Faloutsos on the web; The Teach-To-Learn 
Initiative: 

www.cs.ucr.edu/~michalis/ - 5k 
Petros Faloutsos 

Courses • Press Coverage • Publications • Research Highlights • Awards • 
MAGIX Lab • Curriculum Vitae • Family • Other Faloutsos on Web. 

www.cs.ucla.edu/~pfal/ - 4k 



ABSTRACT 

A previously proposed keyword search paradigm produces, as a 
query result, a ranked list of Object Summaries (OSs). An OS is 
a tree structure of related tuples that summarizes all data held in a 
relational database about a particular Data Subject (DS). However, 
some of these OSs are very large in size and therefore unfriendly 
to users that initially prefer synoptic information before proceeding 
to more comprehensive information about a particular DS. In this 
paper, we investigate the effective and efficient retrieval of concise 
and informative OSs. We argue that a good size-/ OS should be a 
stand-alone and meaningful synopsis of the most important infor- 
mation about the particular DS. More precisely, we define a size-Z 
OS as a partial OS composed of / important tuples. We propose 
three algorithms for the efficient generation of size-/ OSs (in ad- 
dition to the optimal approach which requires exponential time). 
Experimental evaluation on DBLP and TPC-H databases verifies 
the effectiveness and efficiency of our approach. 

1. INTRODUCTION 

Web Keyword Search (W-KwS) has been very successful be- 
cause it allows users to extract effectively and efficiently useful 
information from the web using only a set of keywords. For in- 
stance, Example 1 illustrates the partial result of a W-KwS (e.g. 
Google) for Ql: "Faloutsos": a ranked set (with the first three re- 
sults shown only) of links to web pages containing the keyword(s). 
We observe that each result is accompanied with a snippet [21], 
i.e. a short summary that sometimes even includes the complete 
answer to the query (if, for example, the user is only interested in 
whether Christos Faloutsos is a Professor or whether his brothers 
are academics). 

The success of the W-KwS paradigm has encouraged the emer- 
gence of the keyword search paradigm in relational databases (R- 
KwS) [2, 4, 13]. The R-KwS paradigm is used to find tuples that 
contain the keywords and their relationships through foreign-key 
links, e.g. query Q2: "Faloutsos"+'Agrawal" returns Authors Fal- 
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EXAMPLE 2. 02 using an R-KwS (searching DBLP database) 
Author: Christos Faloutsos, Paper: Efficient similarity search in se- 
quence databases, Author: Rakesh Agrawal. 

Author: Christos Faloutsos, Paper: Method for high-dimensionality 
indexing in a multi-media database, Author: Rakesh Agrawal. 

Author: Christos Faloutsos, Paper: Quest: A project on database mining, 
Author: Rakesh Agrawal. 

EXAMPLE 3. 01 using an R-KwS (searching DBLP database) 
Author: Christos Faloutsos 
Author: Michalis Faloutsos 
Author: Petros Faloutsos 



outsos and Agrawal and their associations through co-authored pa- 
pers. Example 2 illustrates the result of a traditional R-KwS for Q2 
on the DBLP database. On the other hand, the R-KwS paradigm 
may not be very effective when trying to extract information about 
a particular data subject (DS), e.g. "Faloutsos" in Ql. Example 3 
illustrates the R-KwS result for Ql, namely a ranked set of Author 
tuples containing the Faloutsos keyword, which are the Author tu- 
ples corresponding to the three brothers. Evidently, this result fails 
to provide comprehensive information to users about the Faloutsos 
brothers, e.g. a complete list of their publications and other cor- 
responding details (Certainly, the R-KwS paradigm remains very 
useful when trying to combine keywords). 

In [8], the concept of object summary (OS) is introduced; an 
OS summarizes all data held in a database about a particular DS. 
More precisely, an OS is a tree with the tuple i DS containing the 
keyword (e.g. Author tuple Christos Faloutsos) as the root node and 
its neighboring tuples, containing additional information (e.g. his 
papers, co-authors etc.), as child nodes. The result for Ql is in fact a 
set of OSs: one per DS that includes all data held in the database for 
each Faloutsos brother. Example 4 illustrates the OS for Christos 
(the complete set of papers and the OSs of the other two brothers 
were omitted due to lack of space). This result evidently provides 
a more complete set of information per brother. 
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Example 4. The OS for Christos Faloutsos 
Author: Christos Faloutsos 

Paper: On Power-law Relationalships of the Internet Topology. 

Conference:SIGCOMM. Year:1999. 

Co-Author(s):Michalis Faloutsos, Petros Faloutsos. 
Paper: An Efficient Pictorial Database System for PSQL. 

Conference:IEEE Trans. Software Eng. Year:1988. 

Co-Author(s):N. Roussopoulos, T. Sellis. 



Paper: Declustering Using Fractals. 

Conference:PDIS. Year:1993. Co-Author(s):Pravin Bhagwat. 
Paper: Declustering Using Error Correcting Codes. 

Conference:PODS. Year:1989. Co-Author(s):Dimitris N. Metaxas. 
(Total 1,309 tuples) 



EXAMPLE 5. The size-lOSs for Ql and 1=15 

Author: Christos Faloutsos 
Paper: On Power-law Relationalships of the Internet Topology. 
Conference:SIGCOMM. Year:1999. 
Co-Author(s):Michalis Faloutsos, Petros Faloutsos. 
Paper: The QBIC Project: Querying Images by Content, Using, Color, 
Texture and Shape. 
Conference:SPIE. Year:1993. 

Co-Author(s):Carlton W. Niblack, Dragutin Petkovic, Peter Yanker. 
Paper: Efficient and Effective Querying by Image Content. 
ConferenceJ. Intell. Inf. Syst. Year:1994. 
Co-Author(s):N. Roussopoulos, T. Sellis. 

Author: Michalis Faloutsos 
Paper: On Power-law Relationalships of the Internet Topology. 

Conference:SIGCOMM. Year:1999. 

Co-Author(s):Christos Faloutsos, Petros Faloutsos. 
Paper: QoSMIC: Quality of Service Sensitive Multicast Internet Protocol. 

Conference:SIGCOMM. Year:1998. 

Co-Author(s):Anindo Banerjea, Rajesh Pankaj. 
Paper: Aggregated Multicast with Inter-Group Tree Sharing. 

Conference:Networked Group Communication. Year:2001. 

Co-Author(s):Aiguo Fei. 

Author: Petros Faloutsos 
Paper: On Power-law Relationalships of the Internet Topology. 

Conference:SIGCOMM. Year:1999. 

Co-Author(s):Christos Faloutsos, Michalis Faloutsos. 
Paper: Composable controllers for physics-based character animation. 

Conference:SIGGRAPH. Year:2001. 

Co-Author(s):Michiel van de Panne, Demetri Terzopoulos. 
Paper: The virtual stuntman: dynamic characters with a repertoire of 
autonomous motor skills. 

Conference:Computers & Graphics 25. Year:2001. 

Co-Author(s):Michiel van de Panne, Demetri Terzopoulos. 



From Example 4, we can observe that some of the OSs may be 
very large in size; e.g. Christos Faloutsos has co-authored many 
papers and his OS consists of 1,309 tuples. This is not only un- 
friendly to users that prefer a quick glance first before deciding 
which Faloutsos they are really interested in, but also expensive 
to produce. Therefore, a partial OS of size I, composed of only / 
representative and important tuples, may be more appropriate. 

In this paper, we investigate in detail the effective and efficient 
generation of size-/ OSs. Example 5 illustrates Ql with /=15 on 
the DBLP database; namely a set of size-15 OSs composed of only 
15 important tuples for each DS. From the user's perspective, the 
semantics of this paradigm resemble more a W-KwS rather than a 
R-KwS. For instance, the complete OS of Example 4 resembles a 
web page (as they both include comprehensive information about a 
DS), whereas the size-/ OSs of Example 5 resemble the snippets of 
Example 1 . Therefore, users with W-KwS experience will poten- 
tially find it friendlier and also closer to their expectations. 

OSs and size-/ OSs can have many applications. For example, 
OSs can automate responds to data protection act (DPA) subject 
access requests (e.g. the US Privacy Act of 1974, UK DPA of 1984 
and 1998 [1] etc.). According to DPA access requests, DSs have 
the right to request access from any organization to personal infor- 
mation about them. Thus, data controllers of organizations must 



extract data for a given DS from their databases and present it in 
an intelligible form [10]. Another application is for intelligent ser- 
vices searching information about suspects from various databases. 
Hence, size-/ OSs can also be very useful as they enhance the us- 
ability of OSs. In general, a size-/ OS is a concise summary of 
the context around any pivot database tuple, finding application in 
(interactive) data exploration, schema extraction, etc. 

We should effectively generate a stand-alone size-/ OS, com- 
posed of / important tuples only, so that the user can comprehend it 
without any additional information. A stand-alone size-/ OS should 
preserve meaningful and self-descriptive semantics about the DS. 
As we explain in Section 3, for this reason, the / tuples should form 
a connected graph that includes the root of the OS (i.e. i DS ). To 
distinguish the importance of individual tuples t; to be included in 
the size-/ OS, a local importance score (denoted as Im(OS, if)) 
is defined by combining the tuple's global importance score in the 
database (denoted as Im(U)) and its affinity [8] in the OS (denoted 
as Afiti)). Based on the local importance scores of the tuples of 
an OS, we can find the partial OS of size / with the maximum im- 
portance score, which includes tuples that are connected with t DS . 

The efficient size-/ generation of OSs is a challenging problem. 
A brute force approach, that considers all candidate size-/ OSs be- 
fore finding the one with the maximum importance, requires expo- 
nential time. We propose an optimal algorithm based on dynamic 
programming, which is efficient for small problems, however, it 
does not scale well with the OS size and /. In view of this, we 
design three practical greedy algorithms. 

We provide an extensive experimental study on DBLP and TPC- 
H databases, which includes comparisons of our algorithms and 
verifies their efficiency. To verify the effectiveness of our frame- 
work, we collected user feedback, e.g. by asking several DBLP 
authors (i.e. the DSs themselves) to assess the computed size-/ OS- 
s of themselves on the DBLP database. The users suggested that the 
results produced by our method are very close to their expectations. 

The rest of the paper is structured as follows. Section 2 describes 
background and related work. Section 3 describes the semantics 
of size-/ OS keyword queries and formulates the problem of their 
generation. Sections 4 and 5 introduce the optimal and greedy al- 
gorithms respectively. Section 6 presents experimental results and 
Section 7 provides concluding remarks. 

2. BACKGROUND AND RELATED WORK 

In this section, we first describe the concept of object summaries 
(OSs), which we build upon in this paper. We then present and 
compare other related work in R-KwS, ranking and summarization. 
To the best of our knowledge there is no previous work that focuses 
on the computation of size-/ OSs. 

2.1 Object Summaries 

In the context of OS search in relational databases [8, 7], a query 
is a set of keywords (e.g. "Christos Faloutsos") and the result is a 
set of OSs. An OS is generated for each tuple (i DS ) found in the 
database that contains the keyword(s) as part of an attribute's value 
(e.g. tuple "Christos Faloutsos" of relation Author in the DBLP 
database). An OS is a tree structure composed of tuples, having i DS 
as root and t DS 's neighboring tuples (i.e. those associated through 
foreign keys) as its children/descendants. 

In order to construct OSs, this approach combines the use of 
graphs and SQL. The rationale is that there are relations, denoted 
as R DS (e.g. the Author relation), which hold information about the 
queried Data Subjects (DSs) and the relations linked around i? DS s 
contain additional information about the particular DS. For each 
7? DS , a Data Subject Schema Graph (G DS ) can be generated; this is 
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Figure 1: The DBLP Database Schema 
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Figure 2: The DBLP Author G DS (Annotated with (Affinity), 
max(i?i) and mniax(7?i)) 

a directed labeled tree that captures a subset of the database schema 
with R m as a root. (Figures 1 and 1 1 illustrate the schemata of 
the DBLP and TPC-H databases whereas Figures 2 and 12 illus- 
trate exemplary G DS s.) Each relation in G DS is also annotated with 
useful information that we describe later, such as affinity and im- 
portance. G DS is a "treealization" of the schema, i.e. R DS becomes 
the root, its neighboring relations become child nodes and also any 
looped or many-to-many relationships are replicated. Examples of 
such replications are relations PaperCitedBy, PaperCites and Co- 
Author on Author G DS and relations Partsupp, Lineitem, Parts etc. 
on Customer G DS (see G DS s in Figures 2 and 12). (User evalua- 
tion in [8] verified that the tree format (achieved via such replica- 
tions) increases significantly friendliness and ease of use of OSs.) 
The challenge now is the selection of the relations from G DS which 
have the highest affinity with R DS ; these need to be accessed and 
joined in order to create a good OS. To facilitate this task, affinity 
measures of relations (denoted as Af(Ri)) in G DS are investigated, 
quantified and annotated on the G DS . The affinity of a relation Ri 
to 7? DS can be calculated using the following formula: 

Af(Ri) = m i W i ■ Af{Rparent), (1) 

3 

where j ranges over a set ofmetrics (mi, ma, ■ • • , m n ), their corre- 
sponding weights (wi, W2, • • • , w n ) and Af(Rp arent ) is the affinity 
of Ri's parent to 7? DS . Affinity metrics between Ri and R DS in- 
clude (1) their distance and (2) their connectivity properties on both 
the database schema and the data-graph (see [8] for more details). 
Given an affinity threshold 6, a subset of G DS can be produced, 
denoted as G (0). Finally, by traversing G DS (6>) (e.g. by join- 
ing the corresponding relations) we can generate the OSs (either 
by using the precomputed data-graph or directly from the database 
using Algorithm 5). More precisely, a breadth-first traversal of the 
corresponding G DS (8) with the t DS tuple as the initial root entry 
of the OS tree is applied. For instance, for keyword query Ql, 
Author G DS of Fi gure 2 and 6=0 .7 the report presented in Exam- 
ple 4 will automatically be generated. Note that Author G DS (0.7) 
includes all relations whilst Customer G DS (0.7) includes only Cus- 
tomer, Nation, Region, Order, Lineitem and Partsupp relations (s- 
ince all these relations have affinity greater than 0.7). Similarly, the 
set of attributes Aj from each relation Rj that are included in a G DS 



are selected by employing an attributes affinity and a threshold (i.e. 
9'). For example, in a Customer OS, Comment is excluded from 
Partsupp relation as it is not relevant to Customer DSs. 

2.2 R-KwS and Ranking 

R-KwS techniques facilitate the discovery of joining tuples (i.e. 
Minimal Total Join Networks of Tuples (MTJNTs) [13]) that col- 
lectively contain all query keywords and are associated through 
their keys; for this purpose the concept of candidate networks is in- 
troduced; see, for example, DISCOVER [13], BANKS [2, 4]. The 
OSs paradigm differs from other R-KwS techniques semantically, 
since it does not focus on finding and ranking candidate networks 
that connect the given keywords, but searches for OSs, which are 
trees centered around the data subject described by the keywords. 

Precis Queries [15, 19] resemble size-/ OSs as they append ad- 
ditional information to the nodes containing the keywords, by con- 
sidering neighboring relations that are implicitly related to the key- 
words. More precisely, a precis query result is a logical subset of 
the original database (i.e. a subset of relations and a subset of tu- 
ples). For instance, the precis of Ql is a subset of the database 
that includes the tuples of the three Faloutsos Authors and a subset 
of their (common) Papers, Co-Authors, Conferences, etc. In con- 
trast, our result is a set of three separate size-/ OSs (Example 5). A 
thorough evaluation between OSs and precis appears in our earlier 
work [8]. 

R-KwS techniques also investigate the ranking of their results. 
Such ranking paradigms consider: 

1) IR-Style techniques, which weight the amount of times key- 
words (terms) appear in MTJNs [12, 16, 17, 23]. However, such 
techniques miss tuples that are related to the keywords, but they do 
not contain them [3]; e.g. for Ql, tuples in relation Papers also have 
importance although they do not include the Faloutsos keyword. 

2) Tuples' Importance, which weights the authority flow through 
relationships, e.g. ObjectRank [3], [22], ValueRank [9], PageRank 
[5], BANKS (PageRank inspired) [2], [4], XRANK [1 1] etc. In this 
paper we use tuples' importance to model global importance scores 
and more precisely global ObjectRank (for DBLP) and ValueRank 
(for TPC-H). (Note that our algorithms are orthogonal to how tuple 
importance is defined and other methods could also be investigat- 
ed.) ObjectRank [3] is an extension of PageRank on databases and 
introduces the concept of Authority Transfer Rates between the tu- 
ples of each relation of the database (Authority Transfer Rates are 
annotated on the so called Authority Transfer Schema Graph, de- 
noted as G , e.g. Figure 13). They are based on the observation 
that solely mapping a relational database to a graph (as in the case 
of the web) is not accurate and a G A is required to control the flow 
of authority in neighboring tuples. For instance, well cited papers 
should have higher importance than papers citing many other pa- 
pers or a well cited paper should have better ranking than another 
one with fewer citations. ValueRank is an extension of ObjectRank 
which also considers the tuples' values and thus can be applied 
on any database (e.g. TPC-H) in contrast to ObjectRank which is 
mainly effective on authoritative flow data such as bibliographic da- 
ta (e.g. DBLP). For instance, in trading databases, a customer with 
five orders of values $10 may get lower importance than another 
customer with three orders of values $100. 

2.3 Other Related Work 

Document summarization techniques have attracted significant 
research interest [20, 21]. In general, these techniques are IR-style 
inspired. Web snippets [21] are examples of document summaries 
that accompany search results of W-KwSs in order to facilitate their 
quick preview (e.g. see Example 1). They can be either static (e.g. 
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composed of the first words of the document or description meta- 
data) or query-biased (e.g. composed of sentences containing many 
times the keywords) [20]. Still, the direct application of such tech- 
niques on databases in general and OS in particular is ineffective; 
e.g. they disregard the relational associations and semantics of the 
displayed tuples. For example consider Ql and Example 4, pa- 
pers authored by Faloutsos (although don't include the Faloutsos 
keyword) have importance analogous to their citations and authors; 
this is ignored by document summarization techniques. 

XML keyword search techniques, similarly to R-KwSs, facil- 
itate the discovery of XML sub-trees that contain all query key- 
words (e.g. "Faloutsos"+"Agrawal"). Analogously, XML snippets 
[14] are sub-trees of the complete XML result, with a given size, 
that contain all keywords. An apparent difference between size-/ 
OSs and XML snippets is their semantics which is analogous to 
the semantic difference between complete OSs and XML results. 
Therefore, their generation is a completely different problem. An 
interesting similarity is that both size-/ OS and XML snippets are 
sub-trees of the corresponding complete results, hence composed 
of connected nodes. This common property is for the same reason, 
i.e. to preserve self-descriptiveness. 

3. SIZE- L OS 

A size-/ OS keyword query consists of (1) a set of keywords and 
(2) a value for / (e.g. Ql and /=15) and the result comprises a 
set of size-/ OSs. A good size-? OS should be a stand-alone and 
meaningful synopsis of the most important information about the 
particular DS. 

DEFINITION 1. Given an OS and an integer size I, a candidate 
size-l OS is any subset of the OS composed of I tuples, such that all 
I tuples are connected with t DS (i.e. the root of the OS tree). 

Definition 1 guarantees that the size-/ OS remains stand-alone, 
(so users can understand it as it is without any additional tuples); 
i.e. by including connecting tuples we also include the semantics 
of their connection to the DS. (Recall that this criterion was also 
used in [14] for the same reasons.) Consider the example of Figure 
3 which is a fraction of the Faloutsos OS (in the DBLP database). 
Even, if the Paper "Efficient and Effective Querying by Image Con- 
tent" has less local importance (e.g. 20) than the Co-Author(s) Sel- 
lis (e.g. 43) and Roussopoulos (e.g. 34), we cannot exclude the 
Paper and include only the Co- Authors. The rationale is that by 
excluding the Paper tuple we also exclude the semantic association 
between the Author and Co-Author(s), which in this case is their 
common paper. Also note that a size-/ OS will not necessarily in- 
clude the I tuples with the largest importance scores. For example, 
the Co- Author Roussopoulos, although with larger importance than 
the particular Paper, may have to be excluded from a size-/ OS (e.g. 
from a size-3 OS which will consist of (1) Author "Faloutsos", (2) 
Paper "Efficient ..." and (3) Co-Author "Sellis"). 

Given an OS, we can extract exponentially many size-? OSs that 
satisfy Definition 1 . In the next section we define a measure for the 
importance (i.e., quality) of a candidate size-/ OS. Our goal then 
would be to retrieve a size-/ OS of the maximum possible quality. 

3.1 Importance of a Size-/ OS 

The (global) importance of any candidate size-/ OS S, denoted 
as Im(S), is defined as: 

Im(S)=^2lm(OS,ti), (2) 

t»es 

where Im(OS,ti) is the local importance of tuple U (to be de- 
fined in Section 3.2 below). We say that a candidate size-/ OS 




Figure 3: A Fraction of the Faloutsos OS (Annotated with Lo- 
cal Importance) 

is an optimal size-/ OS, if it has the maximum Im(S) (denoted 
as max(Im(S))) over all candidate size-/ OSs for the given OS. 
Wherever an optimal size-/ OS is hard to find, we target the retrieval 
of a sub-optimal size-/ OS of the highest possible importance. 

3.2 Local Importance of a Tuple (Im(OS, tj) 

The local importance of Im(OS,U) of each tuple U in an OS 
can be calculated by: 

Im(OS, U) = Im(ti) ■ Af(U), (3) 

where Im(ti) is the global importance of ti in the database. We use 
global ObjectRank and ValueRank to calculate global importance, 
as discussed in Section 2.2. Af{U) is the affinity of U to the t DS ; 
namely the affinity Af(Ri) of the corresponding relation Ri where 
U belongs, to R DS . This can be calculated from G DS using Equation 
1 , as discussed in Section 2. 1 (alternatively, a domain expert can set 
Af(Ri)s manually). For example, if tuple ti is paper "Efficient." 
with Im(ti)=2lJ4 and Af(U)=Af(R Paper )=0.92 (see the affin- 
ity on Author G DS in Figure 2), then Im{OS, U)= 21.74*0.92=20. 

Multiplying global importance Im(ti) with affinity Af(ti) re- 
duces the importance of tuples that are not closely related to the 
DS. For instance, although paper "Efficient .." and year "1988" 
have equal global importance scores (21.74 and 21.64, respective- 
ly), their local importance scores become 20 (=21.74*0.92) and 
18 (=21.64*0.83) respectively. The use of importance and affin- 
ity metrics is inspired by other earlier work; e.g. XRANK and 
precis employ variations of importance and affinity [11, 15]. For 
defining affinity in [11, 15], only distance is considered; however, 
as it is shown in [8] distance is only one among the possible affinity 
metrics (e.g. cardinality, reverse cardinality etc.). 

3.3 Problem Definition 

The generation of a complete OS is straightforward: we only 
have to traverse the corresponding G DS (see Algorithm 5 in the 
Appendix). The generation of a size-/ OS is a more challenging 
task because we need to select / tuples that are connected to the t DS 
of the tree and at the same time result to the largest Im(S). Hence, 
the problem we study in this paper can be defined as follows: 

Problem 1 (Find an optimal size-/ OS). Given at DS , the 
corresponding G DS and I, find a size-l OS S of maximum Im(S) . 

A direct approach for solving this problem is to first generate the 
complete OS (i.e. Algorithm 5)' and then determine the optimal 

'in fact, any tuples or subtrees, which have distance at least / from 
the root i DS are excluded from the OS, as these cannot be part of a 
connected size-/ OS rooted at t DS . 
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size-? OS from it. In Section 4, we propose a dynamic program- 
ming (DP) algorithm for this purpose. If the complete OS is too 
large, solving the problem exactly using DP can be too expensive. 
In view of this, in Section 5, we propose greedy algorithms that 
find a sub-optimal synopsis. In order to further reduce the cost of 
finding a sub-optimal solution, in Section 5.3, we also propose an 
economical approach, which, instead of the complete OS, initially 
generates a preliminary partial OS, denoted as prelim-/ OS. The ra- 
tionale of a prelim-/ OS is to avoid the extraction and consequently 
further processing of fruitless tuples that are not promising to make 
it in the size-/ OS. DP and the greedy algorithms can be applied on 
the prelim-/ OS to find a good sub-optimal size-/ OS. 



4. THE DP ALGORITHM 

This section describes a dynamic programming (DP) algorithm, 
which, given an OS, determines the optimal size-/ OS in it. The OS 
is a tree, as discussed in Section 2. Every node v of the OS tree is 
a tuple ti, and carries a weight w(v), which is the local importance 
Im(OS, U) of the corresponding tuple ti. Given the tree OS, our 
objective is to find a subtree S op t, such that (i) S op t includes the 
root node t DS of OS, (ii) the tree has / nodes, and (iii) its nodes 
have the maximum sum of weights for all trees that satisfy (i) and 
(ii). In the third condition, the sum of node weights corresponds 
to Im(S pt), according to Equation 2. Since this is the maximum 
among all qualifying subtrees, S opt is the optimal size-/ OS. 

Assume that the root t us in S op t has a child v and the subtree 
So P t rooted at v has i nodes. Then, Sopt should be the optimal size- 
i OS rooted at v. DP operates based on exactly this assertion; for 
each candidate node v to be included in the optimal synopsis and 
for each number of nodes i in the subtree of v that can be included, 
we compute the corresponding optimal size-i synopsis and the cor- 
responding sum of weights. The optimal size-i synopsis rooted at v 
is computed recursively from precomputed size-j synopses (j < i) 
rooted at v's children; to find it, we should consider all synopses 
formed by v and all size-(i — 1) combinations of its children and 
subtrees rooted at them. 

Specifically, let d(v) be the depth of a node v in OS (the root 
t DS has depth 0). The subtree rooted at d(v) can contribute at most 
l—d(v) nodes to the optimal solution, because in every solution that 
includes v, the complete path from the root to v must be included 
(due to the fact that t DS should be included and the solution must 
be connected). The construction of the DP algorithm is to compute 
for each node v of OS S V:i : the optimal size-i OS for all i G [1, / — 
d(v)], in the subtree rooted at v. In addition to S„,» the algorithm 
should track W{S Vl i), the sum of weights of all nodes in S„,,. 

DP (Algorithm 1) proceeds in a bottom-up fashion; it starts from 
nodes in OS at depth I — 1; these nodes can only contribute them- 
selves in the optimal solution (nodes at depth at least / cannot par- 
ticipate in a size-/ OS). For each such node v, trivially 
W(S v ,i)=w(v). Now consider a node v at depth k<l — 1. Upon 
reaching v, for all children u of v, quantities S u ,i and W(S u ,i) 
have been computed for all i G [1, / — d(v) — 1]. Let us now see 
how we can compute S Vt i for each i G [1, / — d(v)]. First, each 
S v ,i should include v itself. Then, we examine all possible com- 
binations of v's children and number of nodes to be selected from 
their subtrees, such that the total number of selected nodes is * — 1. 
We do not have to check the subtrees of «'s children, since for each 
number of nodes j to be selected from a subtree rooted at child u, 
we already have the optimal set S u ,j and the corresponding sum of 
weights W(S u ,j). Note that when we reach the OS root r =t° s , 
we only have to compute S r ,;: the optimal size-/ OS (i.e., there is 
no need to compute S r ,i for i G [1,1 — 1]). 



Algorithm 1 The Optimal Size-/ OS (DP) Algorithm 

DP(l, t DS , G DS ) 

1: OS Generation(t DS , G DS ) > generates the complete OS, annotates 

with local importance each node 

2: for each node v at depth I — 1 do set 
3: for each depth k = I — 2 to do 
4: for each node v at depth k do 
5: for i=l to I — d{v) do 

6: S V} i = {v} U the best combination of v's children and nodes 

from them such that the total number of nodes is i — 1 
7: return S r ; 




Depth 


Computed Sets 


3 


5is,i=13, Si4,i=14 


2 


57,1=7, 5 8 ,i=8, 5 9 ,i=9, 5io,i=10, Sn,i=ll, 
5n,2={ll,13}, 5i2,i=12, 5i2, 2 ={12,14} 


1 


52,1=2, 5 3 ,i=3, S 3 .2={3,8}, 5 3 . 3 ={3,7,8}, 5 4 ,i=4, 
5 4 ,2={4,11}, 5 4 , 3 ={4,11,13}, 5 5 ,i=5, 5 M =6, 
5 6 , 2 ={6,12}, 5 6 .3={6,12,4()} 





5i, 4 ={ 1,4,5,6} 



Figure 4: Example: Steps of DP 



As an example, consider the OS shown in Figure 4 (top) and as- 
sume that we want to compute the optimal size-4 OS from it. The 
table shows the steps of DP in computing the optimal sets 

Sv,i in 

a bottom-up fashion, starting from nodes 13 and 14 which are at 
depth 3 (i.e. / — 1). For example, to compute 64, 3={4,1 1,13}, we 
compare the two possible cases S4,3 = {4} U Sio,i U Sn,i and 
<S , 4,3={4} U Si 1,2 since Sio,i U Sn,i and Sn,2 are the only combi- 
nations sets from node 4's children that total to 2 nodes (i — 1=2). 
Sio,i U Sn,i={10,ll} with total weight 43 and Su,2={ll,13} 
with total weight 90. Thus, S 4 , 3 = {4} U Sn, 2 ={4,l 1,13}. Note 
that for nodes that do not have enough children, the number of sets 
that are computed could be smaller than those indicated in the pseu- 
docode. For example, for node 2, we only have S2,i; i.e. ^2,2 and 
S2,3 do not exist although the node is at depth 1 , because node 2 has 
no children. In addition, for the root node, DP only has to compute 
Si, 4, since we only care about the optimal size-/ summary (there 
are no nodes above the root that could use smaller summaries). 

In terms of complexity, we need to compute for each node v in 
the OS up to depth / — 1 up to / — d(v) sets. For each set we need 
to find the optimal combination of children and nodes from them 
to choose. This cost of choosing the best combination increases 
exponentially with i, which is O(Z). Thus, the overall cost of DP 
is 0(n l ) for an input OS of size n, as can be verified in our ex- 
periments. This is essentially the complexity of the problem as DP 
explores all possible summaries systematically and, in the general 
case, there is no way to prune the search space. For large values 
of /, DP becomes impractical and we resort to the greedy heuristics 
described in the next section. Finally, the following lemma proves 
the optimality of DP. 

LEMMA 1. Algorithm 1 computes the optimal size-l OS. 
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PROOF. The optimal size-Z OS S op t includes the root t DS of the 
OS and a set of subtrees rooted as some of t D s children. DP tests 
all possible combinations of children and numbers of nodes from 
the corresponding subtrees, therefore the combination that corre- 
sponds to S pt will be considered. For the specific combination, 
for each child v and number of nodes i, the optimal subtree root- 
ed at v with i nodes (i.e., S v ,i) has already been found during the 
bottom-up computation process of DP. Therefore, DP will select 
and output the optimal combination (which has the largest impor- 
tance among all tested ones). □ 



Algorithm 2 The Bottom-Up Pruning Size-? Algorithm 

Bottom-Up Pruning Size-l (I, t DS , G DS ) 
1: OS Generation(t DS , G DS ) > generates initial size-Z (i.e. complete or 

prelim-Z) OS and initial PQ 

2: while (| size-Z OS| > Z) do 

3: ttem=deQueue(P<2) t> the smallest value from PQ 

4: if !(hasSiblings(size-Z OS, ttem)) then 

5: enQueue(PQ, parent(size-i OS, tt em )) > check whether after 

pruning ttem, its parent becomes a leaf node 
6: prune ttem from size-Z OS 
7: return size-Z OS 



5. GREEDY ALGORITHMS 

Since the DP algorithm does not scale well, in this section, we 
investigate greedy heuristics that aim at producing a high-quality 
size-Z OS, not necessarily being the optimal. A property that the al- 
gorithms exploit is that the local importance of tuples in the OS (i.e. 
Im(OS, ti)) usually decreases with the node depth from the root 
t DS of the OS. Recall that Im(OS,ti) is the product Im(ti)-Af(U), 
where Im(ti) is the global importance of tuple ti and Af(ti) is the 
affinity of the relation that ti belongs to. Af(ti) monotonically de- 
creases with the depth of the tuple since Af(Ri) is a product of 
its parent's affinity and Af(Ri)< 1 (cf. Equation 1). On the other 
hand, the global importance for a particular tuple is to some ex- 
tent unpredictable. Therefore, even though the local importance 
is not monotonically decreasing with the depth of the tuple on the 
OS tree, it has higher probability to decrease than to increase with 
depth. Hence, it is more probable that tuples higher on the OS to 
have greater local importance than lower tuples. Moreover, note 
that due to the non-monotonicity of OSs, existing top-fc techniques 
such as [6, 12, 17] cannot be applied. 

5.1 Bottom- Up Pruning Size-/ Algorithm 

This algorithm, given an initial OS (either a complete or a prelim- 
Z OS) iteratively prunes from the bottom of the tree the n — I leaf 
nodes with the smallest Im(OS, ti), where n is the number of n- 
odes in the complete OS. The rationale is that since tuples need to 
be connected with the root and lower tuples on the tree are expect- 
ed to have lower importance, we can start pruning from the bottom. 
A priority queue (PQ) organizes the current leaf nodes according 
to their local importance. Algorithm 2 shows a pseudocode of the 
algorithm and Figure 5 illustrates the steps. 

More precisely, this algorithm firstly generates the initial OS 
(line 1; e.g. the complete OS using Algorithm 5). The OS Gen- 
eration algorithm generates the initial size-Z OS and also the initial 
PQ (initially holding all leaves of the given OS). Then, the algo- 
rithm iteratively prunes the leaves with the smallest Im(OS,ti). 
Whenever a new leaf is created (e.g., after pruning node 9 in Figure 

5, node 3 becomes a leaf), it is added to PQ. The algorithm ter- 
minates when only Z nodes remain in the tree. The tree is then re- 
turned as the size-Z OS. In terms of time complexity, the algorithm 
performs 0(n) delete operations in constant time, each potentially 
followed by an update to the PQ. Since there are O(n) elements in 
PQ, the cost of each update operation is O(logn). Thus, the over- 
all cost of the algorithm is 0(nlogn). This is much lower than the 
complexity of the DP algorithm, which gives the optimal solution. 

On the other hand, this method will not always return the opti- 
mal solution; e.g. the optimal size-5 OS should include nodes 1, 5, 

6, 12 and 14 instead of 1, 5, 6, 1 1 and 13 (Fig 5(d)). In practice, it 
is very accurate (see our experimental results in Section 6.2), due 
to the aforementioned property of Im(OS, ti), which gives higher 
probability to nodes closer to the root to have a high local impor- 
tance. Lemma 2 proves an optimality condition for this algorithm 




(a) The initial OS 



(b) First leaf pruned out 





(c) The size- 10 OS 



(d) The size-5 OS 



Figure 5: The Bottom-Up Pruning Size-Z Algorithm: Size-/ OSs 
and their Corresponding PQs (annotated with tuple ID and 
local importance) 

(Paper OSs in the DBLP database are an example of this condition; 
to be discussed in Section 6.2). 

LEMMA 2. When the nodes of an OS have monotonically de- 
creasing local importance scores to their distance from the root 
(i.e. the score of each parent is not smaller than that of its chil- 
dren), then the Bottom-Up Pruning Size-l Algorithm returns the 
optimal size-l OS. 

PROOF. PQ.top always holds the node with the current small- 
est score in the OS. This is because PQ.top is by definition the 
smallest among leaf nodes, where leaf nodes always have smaller 
scores than their ancestors. Therefore, by removing the n — I cur- 
rent smallest values (iteratively stored in PQ.top) from an OS, we 
can get the optimal size-Z OS. □ 

5.2 Update Top-Path-/ Algorithm 

We now explore a second greedy heuristic. This algorithm itera- 
tively selects the path p 4 of tuples with the largest average impor- 
tance per tuple (denoted as AI(pi)), adds pi to the size-Z OS and 
removes the nodes of pi from the OS and updates AI(pi) for the 
remaining paths accordingly. The rationale of selecting the path of 
tuples (instead of the tuple) with the current largest importance, is 
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Algorithm 3 The Update Top-path-/ Algorithm 

Update Top-path-l(l, t DS , G DS ) 
1: OS Generation(t DS , G DS ) > generates initial size-Z (i.e. complete or 
prelim-Z) OS, annotates tuples with AI(pi)) 

2: while ([size-! OS| < I) do 

3: pi=path with max AI(pi) 

4: add first Z — | size-Z OS| nodes of pi to size-! OS 

5: if flsize-i OS| < I) then 

6: remove selected path p.; from the tree 

7: for each child v of nodes in p; do 

8: update AI(pi ) for each node tj in the subtree rooted at v 

9: return size-Z OS 




(a) The initial OS (b) First update 




(c) Second update (d) Final update (size-5 OS) 



Figure 6: The Update Top-Path-/ Algorithm: The size-5 OS 
(annotated with tuple ID, local importance and AI(pi); select- 
ed nodes are shaded) 

that since all nodes need to be connected and monotonicity may not 
hold, we facilitate the selection of nodes of large importance even 
though their ancestors may have lower importance. Algorithm 3 is 
a pseudocode of the heuristic and Figure 6 illustrates an example. 

More precisely, this algorithm (like the Bottom-Up Pruning Al- 
gorithm) firstly generates the complete (or alternatively the prelim- 
Z) OS. During the OS generation, for each tuple ti, we also calculate 
the importance per tuple AI(pi) for the corresponding path pi from 
the root to ti. We then select the node with the largest AI(pi) and 
add the corresponding path to the size-/ OS. By removing the nodes 
of pi from the OS, the tree now becomes a forest; each child of a 
node in pi is the root of a tree. Accordingly, the AI(pi) for each 
node ti is updated again to disregard the removed nodes in the path 
selected at the previous step. The process of selecting the path with 
the highest AI(pi), adding it to the size-/ OS is repeated as long as 
less than / nodes have been selected so far. If less than \pi | nodes 
are needed to complete the size-/ OS then only the top nodes of 
the path are added to the size-/ OS (because only these nodes are 
connected to the current size-/ OS). 

Consider the example shown in Figure 6. Node 5 has AI(pi)=55, 
because its path includes nodes 1 and 5 with average Im(OS, ti) 
being (30+80)/2=55. Assuming Z=5, at the first loop, the algo- 
rithm selects nodes 1 and 5 with the largest AI(pi), i.e. 55. Then, 
the nodes along the path (nodes 1 and 5) are added to the size-5 



OS. For the remaining nodes, AI(pi) is updated to disregard the 
removed nodes (see top-right tree in Figure 6). For example, the 
new AI(pi) for node 10 is 22, because its path now includes only 
nodes 4 and 10 with average Im(OS, ti) being 22. The next path 
to be selected is that ending at node 13, which adds two more nodes 
to the snippet. Finally, node 6 is added to complete the size-5 OS. 

The complexity of the algorithm can be as high as O(nZ), where 
n is the size of the complete OS, as at each step the algorithm may 
choose only one node which causes the update of O(n) paths. The 
algorithm can be optimized if we precompute for each node v of 
the tree the node s(v) with the highest AI(pi) in the subtree rooted 
at v. Regardless of any change at any ancestor of v, s(v) should re- 
main the node with the highest AI(pi) in the subtree (because the 
change will affect all nodes in the subtree in the same way). Thus, 
only a small number of comparisons would be needed after each 
path selection to find the next path to be selected. Specifically, for 
each child v of nodes in the currently selected path pi, we need to 
update AI(pi) for s(v) and then compare all s(v)'s to pick the one 
with the largest AI(pi). In terms of approximation quality, this al- 
gorithm not always returns the optimal solution; e.g. the size-3 OS 
will have nodes 1, 5 and 1 1 instead of 1, 5 and 6. However, empir- 
ically, this method gives better results than Bottom-Up Pruning. 

5.3 Top-/ Prelim-/ OS Preprocessing 

Instead of operating on the complete OS, which may be expen- 
sive to generate and search, we propose to work on a smaller OS, 
which hopefully includes a good size-/ OS. We denote such a pre- 
liminary partial OS as prelim-/ OS (with size j where l<j< \OS\). 
On the prelim-/ OS, we can apply any of the proposed algorithms 
so far (of course, DP is not expected to return the optimal result, 
unless the prelim-/ OS is guaranteed to include it). The rationale of 
the prelim-/ OS is to avoid extraction and processing of tuples that 
are not promising to make it in the optimal size-/ OS. Algorithm 
4 is a pseudocode for computing the prelim-/ OS, Table 1 summa- 
rizes symbols and definitions and Figure 7 illustrates an example. 

Determining a prelim-/ OS that includes the optimal size-/ OS 
can be very expensive, therefore we propose a heuristic, which 
produces a prelim-/ OS that includes at least the / nodes of the 
complete OS with the largest local importance (denoted as top-/ 
set). Figure 7(a) illustrates such a prelim-/ OS. Using avoidance 
conditions and simple statistics that summarize the range of local 
importance of every tuple in each relation (e.g. max(i?0) we can 
infer upper bounds for the local importance of tuples and thus safe- 
ly predict whether a candidate path can potentially produce useful 
tuples. 

DEFINITION 2. Given an OS and an integer I, a top-l prelim- 
l OS (or simply prelim-l OS) is a subset of the complete OS that 
includes the I tuples of the OS with the largest local importance. 

We annotate each relation Ri on the G DS graph with the statistics 
max(i?i) and mmax(iii) (see Figure 2). (Recall from Section 2.1 
that we generate G DS graphs for every relation that may contain in- 
formation about DSs.) max(Ri) is the maximum local importance 
of all tuples in Ri, which can be derived from the maximum global 
importance in Ri (a global statistic that is computed/updated inde- 
pendently of the queries) and the affinity of Af(Ri). mmax(i?i) is 
the maximum local importance of all tuples that belong to Ri's de- 
scendant relation nodes in G DS (i.e. the maxj{max(i?j)}; j ranges 
over all such relations) or if Ri has no descendants (leaf node). 

The algorithm for generating the prelim-/ OS is an extension of 
the complete OS generation algorithm (e.g. Algorithm 5). The ex- 
tension incorporates pruning conditions in order to avoid adding to 
the prelim-/ OS fruitless tuples and their subtrees. More precisely, 
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Table 1: Symbols and Definitions (Top-? Prelim-? OSs) 



Symbols 


Definition 


top-Z 


The I nodes with the largest local importance in the OS 


top-Z PQ 


An /-sized priority queue with the current largest local 
importance of extracted tuples 


largest-Z 


The tuple with the Z th largest local importance retrieved 
so far (i.e. the smallest value of top-Z PQ) or if 
| top-Z PQ\<1 


li(U) 


The local importance of tuple tj(i.e. Im(OS, <;)) 


R(U) 


The relation on G Ds that tuple tj belongs to 


Ri(tj) 


The subset of Ri that joins with tuple tj 


max(iii) 


The maximum value of local importance of Ri 


mmax(_Ri) 


The maximum value of max(Ri) of all Ri's descendants 
nodes on G Ds or if Ri has no descendants (leaf node) 


fruitless 
tuple 


A tuple not in top-Z 


fruitless 
G DS 

relation/ 
sub-tree 


A G DS sub-tree starting from relation Ri is considered 
fruitless for a given largest-Z, if none tuples from Ri and 
its descendants can be fruitful for the top-Z(i.e. when 
largest-Z>max(Ri) AND largest-Z >mmax(Rj)) 


fruitful-Z 
relation 


A relation Ri is considered fruitful-Z for a given largest- 
Z, if only up to I nodes from the corresponding Ri(tj) 
can be fruitful for the top-Z, (i.e. when largest- 
l>mmax(Ri)) 



we traverse the G graph in a breath first order. Every extracted 
tuple is appended to the prelim-Z OS (lines 2 and 14) and to queue 
Q (to facilitate the breadth first traversal of the G DS ; see lines 3 and 
15). Let largest-Z be the tuple with the I th largest local importance 
retrieved so far. If the current tuple U is greater than largest-Z, U 
is added to the Z-sized priority queue top-Z PQ as well (in order to 
update the top-Z set; lines 4 and 17). Largest-Z is set to the current 
smallest value of top-Z PQ or to if the top-Z PQ does not contain 
Z values yet (lines 20-23). We traverse the G DS as follows. For each 
tuple de-queued from the queue Q (line 6), we extract all its child 
nodes from each corresponding child relation (lines 7-12) and we 
employ the following avoidance conditions: 

Avoidance Condition 1 (Avoiding fruitless G DS sub-trees): If 
the top-Z PQ already contains Z tuples and largest-Z is greater than 
or equal to the local importance of all tuples of the current relation 
Ri and all its descendants (i.e. largest-Z >max(i?i) AND largest- 
Z>mmax(i?i)), then there is no need to traverse the sub-tree start- 
ing at Ri (line 8). In such cases, we say that the sub-tree starting 
from Ri is fruitless. For instance, consider the example of Figure 
7; while retrieving tuple y%, largest-Z=0.37 and the current child 
relation Ri is Conference with max(Ri)=0.22 and mmax(i?i)=0. 
Thus, we can safely infer that Conference has no fruitful tuples for 
the particular prelim-Z OS. This avoidance condition does not re- 
quire any I/O operations as all information required can cheaply be 
obtained from the annotated G DS . 

Avoidance Condition 2 (Limiting up to / tuple extractions 
from fruitful-Z relations): Assume that we are about to traverse 
Ri in order to extract Ri(tj): the tuples in Ri which join with the 
parent tuple tj. We can limit the amount of tuples returned by this 
join up to Z, if we can safely predict that none of their descendants 
(if any) can be fruitful for the top-Z. We say a relation Ri on the 
G DS is considered fruitful-Z for a given largest-Z, if we can safe- 
ly predict that only up to I tuples from Ri can be fruitful for the 
top-Z and none of their descendants (if any); this is the case when 
Iargest-Z>mmax(i?0 but largest-Z <max(i? 4 )- In other words, we 
can safely extract only up to I tuples greater than the largest-Z from 
a fruitful-Z relation; i.e. there is no need to compute the complete 
join. For instance consider the example of Figure 7, where we are 
about to traverse the fruitful-Z relation PaperCitedBy (a leaf node on 
the G DS , thus a fruitful-Z relation) in order to extract the joins with 
Paper tuple p2 . Then, we can extract from the database only up to Z 



Algorithm 4 The Prelim-Z OS Generation Algorithm 
Prelim-l OS Generation (Z, t DS , G DS ) 
1: largest-Z=0 

2: add t Ds as the root of the prelim-Z 

3: enQueue(Q, t DS ) 

4: enQueue(top-Z PQ, t DS ) 

5: while !(IsEmptyQueue(Q)) do 

6: t j =deQueue(Q) 

7: for each child relation Ri of R{tj) in G DS do 

8: if !(largest-Z>max(Ri) AND largest-Z>mmax(i?j)) 

then > Av. Cond. 1 

9: if (largest-Z > mmax(Ri )) then 

10: Ri (tj )="SELECT * TOP I FROM Ri WHERE 

(tjJD=Ri.JD AND Ri.li >largest-Z)" > Av. Cond. 2. 
tj.lD and i?;.ID represent the keys that tj and Ri join and 
Ri.li the local import, attribute of Ri 

11: else 

12: Ri (tj )="SELECT * FROM R t WHERE (tj .ID=_R 4 .ID)" 

13: for each tuple tj of Ri(tj) do 

14: add tj on prelim-Z as child of tj 

15: enQueue(Q, tj) 

16: if (Zj(tj)>largest-Z) then 

17: enQueue(top-Z PQ, tj) 

18: if (|top-ZPQ| >Z)then 

19: deQueue(top-Z PQ) 

20: if(|top-ZPQ| <Z)then 

21: largest-Z=0 

22: else 

23: largest-Z=Smallest(top-Z PQ) 



tuples with local importance greater than the largest-Z (which is 0, 
since |top-Z PQ\<1). Similarly, when traversing the fruitful-Z rela- 
tion PaperCites with largest-Z=0.12, we extract up to Z tuples larger 
than largest-Z. Note that the Paper relation is not fruitful-Z, since 
largest-Z=0 and mmax(i?p a/)er )=7.38 thus largest-Z <mmax( Rp aper ). 
As a consequence, we cannot apply this avoidance condition and 
hence we need to extract all tuples for Paper. Note, that this condi- 
tion has no impact on M:l relationships since the maximum cardi- 
nality of Ri(tj) is 1 anyway. 

In terms of cost, in the worst case we need up to n I/O accesses 
(if operating directly on the database), where n is the amount of 
nodes in the complete OS, even if we extract only j tuples (recall 
that Avoidance Condition 2 still requires an I/O access even when 
it returns no results). In practice, however, there can be significant 
savings if the top-Z tuples are found early and large subtrees of the 
complete OS are pruned. The prelim-Z OS created according to 
Definition 2 does not essentially contain the optimal size-Z OS, e.g. 
the prelim-5 OS of our example does not contain the cai@ node 
which belongs to the optimal size-5 OS. In practice, we found that 
in most cases the prelim-Z OS did contain the optimal solution. This 
means that all size-Z OS computation algorithms may give the same 
results when applied either on the prelim-Z or complete OS. The 
following lemma proves that if monotonicity holds then the prelim- 
Z OS will certainly include the optimal size-Z OS. 

LEMMA 3. When the nodes of an OS have monotonically de- 
creasing local importance scores to their distance from the root, 
then the prelim-l OS contains the optimal size-l OS. 

PROOF. When monotonicity holds, the optimal size-Z OS is the 
top-Z set (as shown by Lemma 2). Therefore, the prelim-Z OS pro- 
duced by this algorithm that contains the top-Z set is optimal. □ 

Finally, we note that we have also investigated a variant of the 
prelim-Z OS, which includes the largest top-path-Z nodes (rather 
than the top-Z), namely the Z tuples with the largest AI (pi). How- 
ever, this approach did not result to better time or approximation 
quality so we do not further discuss it. 
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(a) The complete OS, the prelim-/ OS and the top-/ set. Nodes 
with low transparency are pruned tuples (e.g. pc-j, ca\o etc.), 
shaded nodes are the top-/ set (e.g. oi, peg etc.) and the rest are 
the remaining tuples of the prelim-/ OS (e.g. P2, P3 etc.) 
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Figure 7: The Prelim-/ OS Generation Algorithm (1=5, t DS =ai, 



6. EXPERIMENTAL EVALUATION 

In this section, we experimentally evaluate the proposed size-/ 
OS concept and algorithms. We evaluate our algorithms using both 
complete and prelim-/ OSs. First, the effectiveness of the proposed 
size-/ OSs is thoroughly investigated with the help of human eval- 
uators. Then, the quality of the size-/ OSs produced by the greedy 
heuristics is compared to that of the corresponding optimal OSs. 
Finally, the efficiency of algorithms is comparatively investigated. 

We used two databases: DBLP 2 and TPC-H 3 (we used scale fac- 
tor 1 in generating the TPC-H dataset). The two databases have 
2,959,511 and 8,661,245 tuples, occupying 319.4MB and 1.1GB 
on the disk, respectively. 

We use ObjectRank (global) [3] and ValueRank [9] to calcu- 
late the global importance for the tuples of the DBLP and TPC- 
H databases respectively. For a more thorough evaluation, we in- 
vestigate scores by various settings that have been studied in [3], 
namely, two G A s: (1) the G A1 s (default) are presented in Figure 13 
whereas (2) the G A2 for the DBLP has common transfer rates (0.3) 
for all edges and for the TPC-H neglects values (i.e. becomes an 
ObjectRank G A ) and three values of d: di=0.85 (default), d 2 =0.10 
and ^3=0.99. We use Equation 1 to calculate affinity (alternatively 



2 http://www.informatik.uni-trier.de/^ley/db/ 
3 http://www.tpc.org/tpch/ 



an expert can define G s and affinity manually, i.e. to select which 
relations to include in each G DS and their affinity). For the experi- 
ments, we used Java, MySQL, cold cache and a PC with an AMD 
Phenom 9650 2.3GHz (Quad-Core) processor and 4GB of memory. 

6.1 Effectiveness 

We used human evaluators to measure effectiveness. First, we 
familiarized them with the concepts of OSs in general and size- 
/ OSs in particular. Specifically, we explained that a good size-/ 
OS should be a stand-alone and meaningful synopsis of the most 
important information about the particular DS. Then, we provided 
them with OSs and asked them to size-/ them for / = 5, 10, 15, 20, 
25, 30. None of our evaluators were involved in this paper. Figure 8 
measures the effectiveness of our approach as the average percent- 
age of the tuples that exist both in the evaluators' size-/ OSs and 
the computed size-/ OS by our methods. This measure corresponds 
to recall and precision at the same time, as both the OSs compared 
have a common size. 

DBLP. Since the DBLP database includes data about real people 
and their papers, we asked the DSs themselves (i.e. eleven authors 
listed in DBLP) to suggest their own Author and Paper size-/ OS- 
s. The rationale of this evaluation is that the DSs themselves have 
best knowledge of their work and can therefore provide accurate 
summaries. Figures 8(a) and (b) plot the recall of the optimal size-/ 
OS for various ObjectRank settings. In general, ObjectRank scores 
produced with G AI -di and G A '-d3 are good options for Author and 
Paper size-/ OSs generation (as these settings produce similar Ob- 
jectRank scores) and always dominate on larger values of /. More 
precisely for G -di, effectiveness ranges from 75% to 90% for 
/=10 to 30, and from 40% to 60% for Z=5. These results are very 
encouraging. User evaluation also revealed that the inter-relational 
ranking properties (e.g. whether paper p 1 is more important than 
author ai) affect crucially the quality of the size-/ OSs. For in- 
stance, on author OSs, evaluators first selected important Paper tu- 
ples to include in the size-/ OS and then additional tuples such as 
co-authors, year, conferences (these were usually included in sum- 
maries of larger sizes, i.e. /> 10). The bias to select Papers (i.e., 1 st - 
level neighbors) is favored by setting G A1 -d2, although in overall 
this setting was not very effective; e.g., in Figure 8(a), this setting 
achieves 73.3% (in comparison to 60% of G A1 -cii) for Z=5. 

The impact of approximated size-/ OSs produced by our greedy 
algorithms on effectiveness is very minor. For instance using s- 
cores produced by the default setting (i.e. G A1 and di=0.85) on the 
Author G DS , the Update Top-Path-/ algorithm generates summaries 
of the same effectiveness as the optimal, whereas Bottom-Up has 
very minor additional loss ranging from 2% to 10%. On the Paper 
G DS , all approaches give the same effectiveness as they all return 
the optimal size-/ OSs. The use of prelim-/ OSs had no impact 
on effectiveness. As we show later, prelim-/ OSs have very minor 
impact on approximation quality which did not affect effectiveness. 

TPC-H. We presented 16 random OSs to eight evaluators and 
asked them to size-/ them. The evaluators were professors and re- 
searchers from Manchester and Hong Kong Universities. In addi- 
tion, for each OS and tuple, a set of descriptive details and statistics 
was also provided. For instance for a customer, the total number, 
size and value of orders and the corresponding minimum, median 
and maximum values of all customers were provided (e.g. similarly 
to the evaluation in [9]). The provision of such details gave a better 
knowledge of the database to the evaluators. 

In summary, the G A1 (for any d) is a safe option as it produces 
good size-/ OSs on both Customer and Supplier OSs (Figures 8(c) 
and 8(d)); e.g. effectiveness results for G A1 -eZi range from 60% to 
78%. On the other hand G A2 , which is the ObjectRank version of 
the G AI , did not satisfy as much the evaluators on Supplier OSs. 
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(a) DBLP Author (Optimal size-2 OS) (b) DBLP Paper (Optimal size-2 OS) 




(c)TPC-H Customer (Optimal size-2 OS) (d) TPC-H Supplier (Optimal size-2 OS) 

Figure 8: Effectiveness (i.e. Recall=Precision) 

Interestingly, we observe that the effectiveness results for size 5 
were very good on both OSs due to good inter-relational ranking. 

Comparative Evaluation. We compared our results with Google 
Desktop (a text document search engine). We store each OS as an 
HTML file and then issue the corresponding query using Google 
Desktop in order to obtain its snippet. Google snippets contain 
a small amount of words from the beginning of the file, combin- 
ing static text such as "Search for Christos Faloutsos in the DBLP 
Database" and the first few tuples (up to three) from the OS (note 
that the order of nodes in an OS is random). We make a less aus- 
tere comparison by counting the selected tuples that belong to the 
corresponding size-5 OS proposed by our evaluators (since Google 
snippets contain only up to three tuples). As expected, in all cas- 
es Google snippets found zero and exceptionally one tuple from 
the corresponding size-5 OS. Detailed results are not shown due to 
space constraints. 

6.2 Approximation Quality 

We now compare the importance of the size-/ OSs produced by 
the greedy methods against the optimal ones. More precisely, the 
results of Figure 9 represent the approximation quality, namely the 
ratio of the achieved size-' OS importance against the optimal im- 
portance. We present the average results for 10 random OSs per 
G DS . The average size (i.e. the amount of tuples) of OSs is also 
indicated (denoted as Aver(|OS|)), 

Figures 9(a)-(e) show the approximation quality produced by the 
default settings (i.e. G A1 and di=0.85). The results show that the 
Update Top-Path-? is always better than the Bottom-Up Pruning 
algorithm. In general, the superiority of Update Top-Path-' over 
Bottom-Up Pruning is up to 10% (excluding Paper OSs where al- 
1 methods achieved 100%). The evaluation also reveals that top-' 
prelim-Z OSs have very low approximation quality loss. They have 
no impact on the Bottom-Up algorithm and only up to 4% on the 
Update Top-Path-/ algorithm. Another observation is that the con- 
tents of the G DS and the values of the local importance scores also 
have a significant impact. For instance, for Paper OSs all methods 
achieved 100% quality. This is because the monotonicity property 
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Figure 9: Approximation Quality 

holds (Lemma 2); the Paper G DS is Paper — > (Author, PaperCit- 
edBy, PaperCites, Year — > (Conference)) and the local importance 
of Conferences is always smaller than those of the corresponding 
Years. In general, inter-relational and intra-relational ranking of tu- 
ples have an impact as well. For instance, Figure 9(f) summarizes 
the average approximation quality for Author OSs with global im- 
portance scores produced by the various settings (where inter and 
intra relational scores vary). The experimental results also reveal 
that the smaller the OS is in comparison to I the more accurate our 
algorithms are. For example, the particular Author OS of Figure 
9(e) with OS | =67 yields 100% approximation quality from all al- 
gorithms, by 1=25. 

6.3 Efficiency 

We compare the run-time performance of our algorithms in Fig- 
ure 10. We used the same OSs as in Section 6.2 (i.e. the same 10 
OSs per G DS ) and used the default settings for generating the glob- 
al importance of tuples (alterative settings do not have any impact 
on the performance). Figures 10(a)-(e) show the costs of our al- 
gorithms for computing size-2* OSs from OSs of various sizes and 
different / values, excluding the time required to generate the OS 
where each algorithm operates on. Figures 10(a)-(d) show the cost- 
s of OSs from various G DS s, while Fi gure 10(e) shows scalability 
for Author OSs of different sizes and common 1=10 (analogous re- 
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Figure 10: Efficiency 

suits were obtained from all G DS s but we omit them due to space 
limitations). Note that the y-axes (time) in all graphs are split to 
two parts; one linear (bottom) and one exponential (top) in order to 
show how the expensive DP scales and at the same time keep the 
differences between the other methods visible. 

As expected, the OS size and I have affect significantly the cost 
(the bigger OS or I is, the more time is required). The cost of DP 
becomes unbearable moderate to large OSs and values of I (we had 
to stop the algorithms after 30 min. of running). Bottom-Up Prun- 
ing is consistently faster than Update Top-Path-/, as it requires few- 
er operations. An interesting observation is that Bottom-Up Prun- 
ing on the complete OS becomes faster as / increases, because n-l 
drops and fewer de-heaping operations are needed. 

Figure 10(f) breaks down the cost to OS generation (bottom of 
the bar) and size-/ computation (top of the bar) for each method. 
We investigated two approaches for generating the OSs; the first 
employs an in-memory data-graph and the second computes the 
OS directly from the database. The OSs are generated much faster 
using the data graph; thus, we present only the data-graph based 
results in Figure 10(f). For example, to generate the Supplier OS- 
s (that have the largest sizes among all tested OSs) only 0.2 sec. 
are required using the data-graph, compared to 12.9 sec. directly 
from the database. The DBLP and TPC-H data-graphs take only 
17 sec. and 128 sec. to generate and occupy 150MB and 500MB, 



respectively. More precisely, our data-graph nodes correspond to 
the database tuples and edges to tuples relationships (through their 
primary and foreign keys). Note that the data-graph is only an in- 
dex and does not contain actual data as nodes capture only keys 
and global importance. Figure 10(f) also shows the average sizes 
of the complete OSs (1,341) and the prelim-/ OSs (134 and 259 for 
I — 10 and / = 50, respectively). The prelim-? OS generation is al- 
ways faster than that of the complete OS; for instance the prelim-5 
OS's size is approximately 10% of the size of the complete OS and 
its generation can be done up to 2.5 times faster (the savings are 
not proportional, because there can be many accesses to fruitless 
relations during the prelim-/ OS generation; i.e. Avoidance Condi- 
tion 2 which still requires access to relations even when it returns 
no results); thus, prelim-/ OSs further reduce the time required by 
our algorithms. Bottom-Up Pruning becomes on average up to 5.7 
times faster whereas the Update Top-Path-/ is up to 4.1 times. Note 
that the size of the database does not impact the OS generation 
time, because hash-maps are used to look-up the required nodes of 
an OS; we omit experimental results, due to space constraints. 

Discussion. In summary, the DP algorithm is not practical on 
large OSs and Vs whereas our greedy algorithms are very fast and 
as we showed in Section 6.2, their results are of high approximation 
quality. Note that, in this paper, our main focus has been on opti- 
mizing the size-/ OS generation, not the OS generation cost (which 
we leave for further investigation as future work). In addition, the 
use of prelim-/ OSs is constantly a better choice over the complete 
OSs since they are always faster with a very minor quality loss. If 
we need to find the size-/ OS at a high speed, then the Bottom-Up 
Pruning is a good choice, since it is consistently the fastest method 
(e.g. using the prelim-50 for Supplier costs 0.12+0.12=0.24 sec). 
If the OS had to be generated from the database, then the Update 
Top-Path-/ algorithm is preferable as it gives better quality and is 
insignificantly more expensive (e.g. 8.08+0.32 sec). 

7. CONCLUSION AND FUTURE WORK 

We investigated the effective and efficient generation of size-/ 
OSs. First, we gave a formal definition of the size-/ OS, which 
targets the synoptic and stand-alone presentation of a large OS. 
We proposed a dynamic programming algorithm and two efficient 
greedy heuristics for producing size-/ OSs. In addition, we pro- 
posed a preprocessing strategy that avoids generating the complete 
OS before producing size-/ OSs. A systematic experimental evalu- 
ation conducted on the DBLP and TPC-H databases verifies the ef- 
fectiveness, approximation quality and efficiency of our techniques. 

A direction of future work concerns the further exploration of al- 
gorithms using hashing and reachability indexing techniques [18]. 
Another challenging problem is the combined size-/ and top- A: rank- 
ing of OSs. In addition, the selection of an appropriate value for / 
is an interesting problem; a natural approach is to select / based on 
the amount of attributes or words it will result, e.g. 20 attributes 
or 50 words. However, this approach results to the reformulation 
of the problem and we plan to investigate it. Finally, it is observed 
that, in the general case, optimal size-/ OSs for different / could be 
very different. This prevents the incremental computation of a size- 
/ OS from the optimal size-(/ — 1) OS, limiting pre-computation or 
caching approaches that could accelerate computation. In the fu- 
ture, we plan to experimentally analyze the space of optimal size-/ 
OSs and identify potential similarities among them that could assist 
their pre-computation and compression. 
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APPENDIX 



Algorithm 5 The OS Generation Algorithm 



OS Generation (t DS ,G DS ) 
1: add t DS as the root of the OS 

enQueue(Q, t DS ) > Queue Q facilitates breath first traversal 

while !(isEmptyQueue(Q)) do 
tj =deQueue(Q) 

for each child relation Ri of R(t j ) in G DS do 

R i (t J )="SELECT * FROM Ri WHERE (t j .m=R i .JD)" 
for each tuple t,;of Ri(tj) do 
add t;on OS as child of tj 
enQueue(Q, tj) 
return OS 
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Figure 11: The TPC-H Database Schema 
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Figure 12: The TPC-H Customers G DS (Annotated with (Affin- 
ity), max(i?i) and mmaxffl,)) 
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Figure 13: The G A s for the DBLP and TPC-H Databases 
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